{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second deduplication\n",
    "\n",
    "This notebook begins with **manifestationmeta.tsv,** and moves toward a smaller dataset that aspires to contain only one copy of each \"work,\" in [FRBR terminology.](https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) \n",
    "\n",
    "However, this reference to FRBR should not be taken very literally. In reality, we're just identifying (relatively) unique title-author pairs, which may or may not line up with \"works\" and \"expressions.\" I say \"relatively\" because fuzzy matching is used to allow for minor variations in spelling and punctuation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from difflib import SequenceMatcher\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### create blocks\n",
    "\n",
    "We start by grouping volumes into \"blocks.\" This is purely a time-reduction step, to avoid useless comparisons of very different volumes. Each block is identified by the first three characters of the author's name plus the first five characters of the title. Those aren't all necessarily matches, but they are probable matches.\n",
    "\n",
    "This strategy does unfortunately mean that the first few characters of names become very important, which is why I made some effort to standardize naming in the first deduplication notebook -- moving e.g. \"sir\" and \"mrs\" to the end of the name. More could probably be done here: names like \"Du Maurier\" and \"Van Dyck\" are potentially tricky."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta = pd.read_csv('manifestationmeta.tsv', sep = '\\t', low_memory = False)\n",
    "\n",
    "blocks = dict()\n",
    "for idx in meta.index:\n",
    "    name = meta.loc[idx, 'author']\n",
    "    if pd.isnull(name) or len(name) < 5:\n",
    "        name = 'nan'\n",
    "        \n",
    "    title = meta.loc[idx, 'shorttitle']\n",
    "    \n",
    "    # note that we use short titles, which means that we'll be using\n",
    "    # the titles for individual volume parts when available\n",
    "    \n",
    "    if pd.isnull(title) or len(title) < 6:\n",
    "        title = 'default'\n",
    "    \n",
    "    blockcode = name[0:5].lower() + title[0:6].lower()\n",
    "    if blockcode not in blocks:\n",
    "        blocks[blockcode] = set()\n",
    "    \n",
    "    blocks[blockcode].add(idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "89178"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(blocks)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Group records into sets that have the same author / title\n",
    "\n",
    "Generally, the strategy here is to loop through each block, comparing each record to all the other records in the block. If we find suffient similarity, we make sure they end up in the same \"group.\"\n",
    "\n",
    "Records that don't match anything else in the block get their own group.\n",
    "\n",
    "But of course the devil is in the details. For instance, we ignore semicolons and colons, which often substitute for each other in titles. We also cap names at 25 chars and titles at 35 chars, because very long names/titles are often encumbered by extra phrases that do not actually disambiguate anything.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "10001\n",
      "20001\n",
      "30001\n",
      "40001\n",
      "50001\n",
      "60001\n",
      "70001\n",
      "80001\n"
     ]
    }
   ],
   "source": [
    "def probablymatch(str1, str2):\n",
    "    \n",
    "    m = SequenceMatcher(None, str1, str2)\n",
    "    match = m.real_quick_ratio()\n",
    "    if match > 0.75:\n",
    "        match = m.ratio()\n",
    "    \n",
    "    return match\n",
    "\n",
    "def cleanstring(astring, cap):\n",
    "    astring = astring.replace(';', '')\n",
    "    astring = astring.replace(':', '')\n",
    "    astring = astring.lower()\n",
    "    if len(astring) > cap:\n",
    "        astring = astring[0 : cap]\n",
    "    return astring\n",
    "    \n",
    "groups = []\n",
    "dubiouscalls = []\n",
    "\n",
    "ctr = 0\n",
    "for code, block in blocks.items():\n",
    "    ctr += 1\n",
    "    if ctr % 10000 == 1:\n",
    "        print(ctr)\n",
    "    \n",
    "    already_checked = dict()\n",
    "    titledict = dict()\n",
    "    authdict = dict()\n",
    "    \n",
    "    # we clean all the titles and authors in the block before \n",
    "    # attempting to match; otherwise you end up doing\n",
    "    # n x n cleaning operations.\n",
    "    \n",
    "    for b in block:\n",
    "        auth = meta.loc[b, 'author']\n",
    "        if pd.isnull(auth) or len(auth) < 4:\n",
    "            auth = 'cannot-match'\n",
    "        else:\n",
    "            auth = cleanstring(auth, 25)\n",
    "        \n",
    "        title = meta.loc[b, 'shorttitle']\n",
    "        if pd.isnull(title) or len(title) < 5:\n",
    "            title = 'cannot-match'\n",
    "        else:\n",
    "            title = cleanstring(title, 35)\n",
    "        \n",
    "        titledict[b] = title\n",
    "        authdict[b] = auth\n",
    "           \n",
    "    for b1 in block:\n",
    "        matched = False\n",
    "        for b2 in block:\n",
    "            if b1 == b2:\n",
    "                continue\n",
    "            if (str(b1) + ' ' + str(b2)) in already_checked:\n",
    "                if not matched:\n",
    "                    matched = already_checked[str(b1) + ' ' + str(b2)]\n",
    "                continue\n",
    "            \n",
    "            auth1 = authdict[b1]\n",
    "            auth2 = authdict[b2]\n",
    "            title1 = titledict[b1]\n",
    "            title2 = titledict[b2]\n",
    "            \n",
    "            if auth1 == 'cannot-match' or auth2 == 'cannot-match':\n",
    "                already_checked[str(b2) + ' ' + str(b1)] = False\n",
    "                continue\n",
    "            if title1 == 'cannot-match' or title2 == 'cannot-match':\n",
    "                already_checked[str(b2) + ' ' + str(b1)] = False\n",
    "                continue\n",
    "            \n",
    "            if auth1 == auth2:\n",
    "                authormatch = 1.0\n",
    "            else:\n",
    "                authormatch = probablymatch(auth1, auth2)\n",
    "                if authormatch < 0.9:\n",
    "                    already_checked[str(b2) + ' ' + str(b1)] = False\n",
    "                    continue\n",
    "            \n",
    "            if title1 == title2:\n",
    "                titlematch = 1.0\n",
    "            else:\n",
    "                titlematch = probablymatch(title1, title2)\n",
    "                if titlematch < 0.88:\n",
    "                    already_checked[str(b2) + ' ' + str(b1)] = False\n",
    "                    continue\n",
    "            \n",
    "            if authormatch + titlematch < 1.85:\n",
    "                already_checked[str(b2) + ' ' + str(b1)] = False\n",
    "                continue\n",
    "            elif authormatch + titlematch < 1.91:\n",
    "                outline = auth1 + \" | \" + title1 + '\\n' + auth2 + ' | ' + title2 + '\\n' + str(authormatch + titlematch) + '\\n'\n",
    "                dubiouscalls.append(outline)           \n",
    "            \n",
    "            # we have a match!\n",
    "            matched = True\n",
    "            found = False\n",
    "            for g in groups:\n",
    "                if b1 in g or b2 in g:\n",
    "                    g.add(b1)\n",
    "                    g.add(b2)\n",
    "                    found = True\n",
    "                    break\n",
    "\n",
    "            if not found:\n",
    "                groups.append({b1, b2})\n",
    "            \n",
    "            already_checked[str(b2) + ' ' + str(b1)] = True\n",
    "            \n",
    "        if not matched:\n",
    "            groups.append({b1})\n",
    "                \n",
    "                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write dubious calls to a file where they can be inspected\n",
    "\n",
    "The cutoffs above are adjusted manually, and arbitrarily. Matches on the low end get grouped into a list named \"dubiouscalls.\" The cells below write that out.\n",
    "\n",
    "By and large, I'm comfortable with these, though there are a few obvious errors; a couple of Bobbsey Twin books get grouped that shouldn't be grouped, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16946"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dubiouscalls)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('dubiouscalls.txt', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for d in dubiouscalls:\n",
    "        f.write(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A little exploratory description\n",
    "\n",
    "E.g., how many groups do we have? How big is the biggest?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "128296"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(groups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "317\n"
     ]
    }
   ],
   "source": [
    "maxsize = 0\n",
    "for g in groups:\n",
    "    if len(g) > maxsize:\n",
    "        maxsize = len(g)\n",
    "print(maxsize)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now the actual deduplication\n",
    "\n",
    "In principle, generally, we want to take one volume from each group of volumes that have matching titles and authors. And in general we want to take the earliest volume, so our resulting dataset will be dated as close as possible to dates of first publication.\n",
    "\n",
    "However, there are complicating cases. What if, for instance, the earliest instance of a novel is a Victorian three-decker edition? That's going to happen pretty often. In that case, we don't want to take *just one volume* from the group; we want all three volumes of the earliest edition. So we need a new rule: take all volumes sharing the *recordid* of the earliest volume. That will get all three volumes of a three-volume edition.\n",
    "\n",
    "But we confront yet another complication! Volumes grouped by a recordid are sometimes three volumes of a single work. But often they are, say, 28 volumes in the *Collected Works of Scott.* All sharing a single record id, but not all the same fictional work. Maybe some of the longer novels are spread across 2 or three volumes, but many of the volumes represent a single novel. This gets bloody complicated.\n",
    "\n",
    "So our *new* rule is: find the earliest volume. Get its record id. Find all volumes sharing that record id (all volumes in the same set). Then take all the volumes that share the same *short title*. If we have been able to identify vols 11 and 12 as *Ivanhoe,* this will get just 11 and 12. However, if we haven't been able to identify titles beyond *Collected Works of Scott,* we'll get all 28 vols! So the final rule is, ignore cases where we recover more than five vols sharing the same recordid. We suspect these are collected works.\n",
    "\n",
    "As we do this, we are going to want to keep track of the number of copies of a volume that have been collapsed into a single deduplicated record. We'll use a column of \"instances\" created in the earlier stage of deduplication; this counts vols that had the same recordid+volnum. We'll further aggregate that into \"copies\": vols that had the same author/title. Moreover, since we may want to distinguish *contemporary* popularity from later canonicity, we're going to keep track of this in two different ways: a general column of copies and a column of copies-published-within-25-yrs of our first example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "10001\n",
      "20001\n",
      "30001\n",
      "40001\n",
      "50001\n",
      "60001\n",
      "70001\n",
      "80001\n",
      "90001\n",
      "100001\n",
      "110001\n",
      "120001\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "selected = []\n",
    "ignored = []\n",
    "errors = 0\n",
    "authtitlecopies = dict()\n",
    "copiesin25yrs = dict()\n",
    "\n",
    "ctr = 0\n",
    "for g in groups:\n",
    "    ctr += 1\n",
    "    if ctr % 10000 == 1:\n",
    "        print(ctr)\n",
    "    \n",
    "    # Some groups contain only a single volume.\n",
    "    if len(g) == 1:\n",
    "        for e in g:\n",
    "            break\n",
    "        selected.append(e)\n",
    "        authtitlecopies[e] = int(meta.loc[e, 'instances'])\n",
    "        copiesin25yrs[e] = authtitlecopies[e]\n",
    "        # For a single volume, all these quantities will be the same.\n",
    "        continue\n",
    "        \n",
    "    if len(g) < 1:\n",
    "        errors += 1\n",
    "        continue\n",
    "    \n",
    "    earliest = ''\n",
    "    earliestdate = 2100\n",
    "    instancectr = Counter()\n",
    "    \n",
    "    for element in g:\n",
    "        date = meta.loc[element, 'inferreddate']\n",
    "        copies = int(meta.loc[element, 'instances'])\n",
    "        \n",
    "        if pd.isnull(date):\n",
    "            date = 2100\n",
    "        else:\n",
    "            date = int(date)\n",
    "        \n",
    "        instancectr[date] += copies\n",
    "        \n",
    "        if earliestdate == 2100 or date < earliestdate:\n",
    "            earliestdate = date\n",
    "            earliest = element\n",
    "            if earliestdate < 1700:\n",
    "                earliestdate = 2100\n",
    "                # don't reward dubious dates\n",
    "    \n",
    "    # now let's add up those copies\n",
    "    allcopies = 0\n",
    "    copiesin25yrsofearliest = 0\n",
    "    \n",
    "    for date, count in instancectr.items():\n",
    "        allcopies += count\n",
    "        if date < (earliestdate + 25):\n",
    "            copiesin25yrsofearliest += count\n",
    "            \n",
    "    record = meta.loc[earliest, 'recordid']\n",
    "    title2match = str(meta.loc[earliest, 'shorttitle'])\n",
    "\n",
    "    matching = []\n",
    "\n",
    "    thisrec = meta.loc[meta.recordid == record, : ]\n",
    "    for idx in thisrec.index:\n",
    "        thistitle = str(thisrec.loc[idx, 'shorttitle'])\n",
    "        match = probablymatch(title2match, thistitle)\n",
    "        if match > 0.9:\n",
    "            matching.append(idx)\n",
    "    \n",
    "    if len(matching) < 6:\n",
    "        selected.extend(matching)\n",
    "        for m in matching:\n",
    "            authtitlecopies[m] = allcopies\n",
    "            copiesin25yrs[m] = copiesin25yrsofearliest\n",
    "    else:\n",
    "        ignored.append((title2match, record))\n",
    "        \n",
    "print(errors)          "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Some exploratory description\n",
    "\n",
    "For instance, how many records did we select. How many groups of vols were ignored?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "135299\n"
     ]
    }
   ],
   "source": [
    "print(len(selected))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "255\n"
     ]
    }
   ],
   "source": [
    "print(len(ignored))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('[Complete works', 1417287),\n",
       " ('The life of a lover. In a series of letters', 321626),\n",
       " ('Works, in an English translation', 1364287),\n",
       " ('The novels of Captain Marryat', 9245242),\n",
       " ('Scenes of Parisian life;', 1203519),\n",
       " ('Scenes of private life;', 1203519),\n",
       " ('Scenes of private life;', 1203519),\n",
       " ('Scenes of provincial life;', 1203519),\n",
       " ('Scenes of provincial life;', 1203519),\n",
       " ('Scenes of Parisian life', 7678129),\n",
       " (\"The world's one hundred best short stories\", 6511333),\n",
       " (\"[Scott's novels]\", 8665211),\n",
       " (\"Journeys through Bookland; a new and original plan for reading, applied to the world's best literature for children\",\n",
       "  5543768),\n",
       " ('Complete writings of O. Henry [i.e. W.S. Porter]', 1376739),\n",
       " ('Works', 8881896),\n",
       " ('The works of Louise M?_hlbach in eighteen volumes', 7707100),\n",
       " ('The Riverside readers', 7910637),\n",
       " ('The real America in romance', 9909118),\n",
       " ('Philosophic and analytic studies;', 1203519),\n",
       " ('Philosophic and analytic studies;', 1203519)]"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ignored[0:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's write the ignored records to file\n",
    "\n",
    "with open('ignoredgroups.tsv', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for title, record in ignored:\n",
    "        f.write(title + '\\t' + str(record) + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now actually produce and write the dataframe\n",
    "\n",
    "All of our effort so far has gone into selecting a list of indices that will be retained. Now we have to use those indices to actually produce a new dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "# like so\n",
    "\n",
    "deduped = meta.loc[selected, : ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>docid</th>\n",
       "      <th>oldauthor</th>\n",
       "      <th>author</th>\n",
       "      <th>authordate</th>\n",
       "      <th>inferreddate</th>\n",
       "      <th>latestcomp</th>\n",
       "      <th>datetype</th>\n",
       "      <th>startdate</th>\n",
       "      <th>enddate</th>\n",
       "      <th>imprint</th>\n",
       "      <th>...</th>\n",
       "      <th>locnum</th>\n",
       "      <th>oclc</th>\n",
       "      <th>place</th>\n",
       "      <th>recordid</th>\n",
       "      <th>enumcron</th>\n",
       "      <th>volnum</th>\n",
       "      <th>title</th>\n",
       "      <th>parttitle</th>\n",
       "      <th>shorttitle</th>\n",
       "      <th>instances</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>81024</th>\n",
       "      <td>uva.x000677513</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1920</td>\n",
       "      <td>1920</td>\n",
       "      <td>s</td>\n",
       "      <td>1920</td>\n",
       "      <td></td>\n",
       "      <td>New York;Grosset &amp; Dunlap;c1920.</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4153330</td>\n",
       "      <td>nyu</td>\n",
       "      <td>9795572</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins in the great West / | $c: by...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins in the great West</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82817</th>\n",
       "      <td>nyp.33433082332069</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1922</td>\n",
       "      <td>1922</td>\n",
       "      <td>s</td>\n",
       "      <td>1922</td>\n",
       "      <td></td>\n",
       "      <td>New York;Grosset and Dunlap;1922</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1410101</td>\n",
       "      <td>nyu</td>\n",
       "      <td>5805688</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey Twins at the county fair / | $c: b...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey Twins at the county fair</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71530</th>\n",
       "      <td>nyp.33433082332028</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1913</td>\n",
       "      <td>1913</td>\n",
       "      <td>s</td>\n",
       "      <td>1913</td>\n",
       "      <td></td>\n",
       "      <td>New York;Grosset &amp; Dunlap;c1913</td>\n",
       "      <td>...</td>\n",
       "      <td>PZ7.H772Bocs</td>\n",
       "      <td>2568839</td>\n",
       "      <td>nyu</td>\n",
       "      <td>8689225</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins at school, | $c: by Laura Le...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins at school</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60840</th>\n",
       "      <td>nyp.33433082332010</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1907</td>\n",
       "      <td>1907</td>\n",
       "      <td>s</td>\n",
       "      <td>1907</td>\n",
       "      <td></td>\n",
       "      <td>New York;Grosset &amp; Dunlap;c1907.</td>\n",
       "      <td>...</td>\n",
       "      <td>PS3515.O585B603 1907</td>\n",
       "      <td>9613473</td>\n",
       "      <td>nyu</td>\n",
       "      <td>5104160</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins at the seashore / | $c: by L...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins at the seashore</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78570</th>\n",
       "      <td>nyp.33433082344874</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>Hope, Laura Lee</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1919</td>\n",
       "      <td>1919</td>\n",
       "      <td>s</td>\n",
       "      <td>1919</td>\n",
       "      <td></td>\n",
       "      <td>New York;Grosset &amp; Dunlap;1919.</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2109184</td>\n",
       "      <td>nyu</td>\n",
       "      <td>5346603</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins in Washington. / | $c: By La...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Bobbsey twins in Washington</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                    docid        oldauthor           author authordate  \\\n",
       "81024      uva.x000677513  Hope, Laura Lee  Hope, Laura Lee        NaN   \n",
       "82817  nyp.33433082332069  Hope, Laura Lee  Hope, Laura Lee        NaN   \n",
       "71530  nyp.33433082332028  Hope, Laura Lee  Hope, Laura Lee        NaN   \n",
       "60840  nyp.33433082332010  Hope, Laura Lee  Hope, Laura Lee        NaN   \n",
       "78570  nyp.33433082344874  Hope, Laura Lee  Hope, Laura Lee        NaN   \n",
       "\n",
       "       inferreddate  latestcomp datetype startdate enddate  \\\n",
       "81024          1920        1920        s      1920           \n",
       "82817          1922        1922        s      1922           \n",
       "71530          1913        1913        s      1913           \n",
       "60840          1907        1907        s      1907           \n",
       "78570          1919        1919        s      1919           \n",
       "\n",
       "                                imprint    ...                   locnum  \\\n",
       "81024  New York;Grosset & Dunlap;c1920.    ...                      NaN   \n",
       "82817  New York;Grosset and Dunlap;1922    ...                      NaN   \n",
       "71530   New York;Grosset & Dunlap;c1913    ...             PZ7.H772Bocs   \n",
       "60840  New York;Grosset & Dunlap;c1907.    ...     PS3515.O585B603 1907   \n",
       "78570   New York;Grosset & Dunlap;1919.    ...                      NaN   \n",
       "\n",
       "          oclc place recordid enumcron volnum  \\\n",
       "81024  4153330   nyu  9795572      NaN    NaN   \n",
       "82817  1410101   nyu  5805688      NaN    NaN   \n",
       "71530  2568839   nyu  8689225      NaN    NaN   \n",
       "60840  9613473   nyu  5104160      NaN    NaN   \n",
       "78570  2109184   nyu  5346603      NaN    NaN   \n",
       "\n",
       "                                                   title parttitle  \\\n",
       "81024  The Bobbsey twins in the great West / | $c: by...       NaN   \n",
       "82817  The Bobbsey Twins at the county fair / | $c: b...       NaN   \n",
       "71530  The Bobbsey twins at school, | $c: by Laura Le...       NaN   \n",
       "60840  The Bobbsey twins at the seashore / | $c: by L...       NaN   \n",
       "78570  The Bobbsey twins in Washington. / | $c: By La...       NaN   \n",
       "\n",
       "                                 shorttitle instances  \n",
       "81024   The Bobbsey twins in the great West         1  \n",
       "82817  The Bobbsey Twins at the county fair         1  \n",
       "71530           The Bobbsey twins at school         1  \n",
       "60840     The Bobbsey twins at the seashore         1  \n",
       "78570       The Bobbsey twins in Washington         1  \n",
       "\n",
       "[5 rows x 25 columns]"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deduped.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### add copy counts\n",
    "\n",
    "Before we write out the dataframe, add columns reflecting the number of copies collapsed into each record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_copy_count(idx, dictionary):\n",
    "    return dictionary[idx]\n",
    "\n",
    "deduped = deduped.assign(allcopiesofwork = deduped.apply(lambda row: get_copy_count(row.name, authtitlecopies), axis = 1))\n",
    "deduped = deduped.assign(copiesin25yrs = deduped.apply(lambda row: get_copy_count(row.name, copiesin25yrs), axis = 1))\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['docid', 'oldauthor', 'author', 'authordate', 'inferreddate',\n",
      "       'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',\n",
      "       'imprintdate', 'contents', 'genres', 'subjects', 'geographics',\n",
      "       'locnum', 'oclc', 'place', 'recordid', 'enumcron', 'volnum', 'title',\n",
      "       'parttitle', 'shorttitle', 'instances', 'allcopiesofwork',\n",
      "       'copiesin25yrs'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(deduped.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort rows\n",
    "deduped.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)\n",
    "\n",
    "# put columns in desired order (title last)\n",
    "deduped = deduped[['docid', 'oldauthor', 'author', 'authordate', 'inferreddate',\n",
    "       'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',\n",
    "       'imprintdate', 'contents', 'genres', 'subjects', 'geographics',\n",
    "       'locnum', 'oclc', 'place', 'recordid', 'instances', 'allcopiesofwork',\n",
    "       'copiesin25yrs', 'enumcron', 'volnum', 'title',\n",
    "       'parttitle', 'shorttitle']]\n",
    "\n",
    "# write to file\n",
    "deduped.to_csv('workmeta.tsv', sep = '\\t', index = False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
